perf(serve): bound JVM/Neo4j memory and dedupe topology snapshot by aksOps · Pull Request #118 · RandomCodeSpace/codeiq

aksOps · 2026-05-04T17:13:45Z

Summary

OOM review of codeiq serve on AKS at the typical ~200 K-node graph
scale identified four cumulative offenders fighting for the same
cgroup memory limit. This PR addresses all four:

Topology snapshot deduplication. McpTools and TopologyController
each held an independent in-heap topology snapshot. Extracted a single
query/TopologySnapshotProvider (60 s TTL, idle-releaseable) shared by
both. The Snapshot record carries a loaded flag so the controller
can still distinguish "no source available" (404) from "graph is
empty" (200), preserving the legacy contract.
Spring cache. @EnableCaching was on but no CacheManager bean was
registered → unbounded ConcurrentMapCacheManager. Switched the
serving profile to Caffeine (maximumSize=1000, expireAfterWrite=5m).
Neo4j page cache. Capped at 256 MiB via
GraphDatabaseSettings.pagecache_memory so embedded Neo4j stops
auto-grabbing ~50 % of free RAM at startup.
AKS JVM flags. Added -XX:MaxRAMPercentage=50 -XX:InitialRAMPercentage=25 -XX:+UseG1GC -XX:+ExitOnOutOfMemoryError to scripts/aks-launch.sh
so the heap is pinned to half the cgroup limit, leaving room for
Neo4j + Metaspace + JIT + Tomcat NIO + OS slack.

Plus a runbook at shared/runbooks/aks-oom-quick-fix.md with the
diagnostic flow (OOMKilled vs readiness-flap) and the Deployment YAML
patch.

Net effect at 200 K nodes / 4 GiB pod: peak heap ceiling drops ~50 %,
no more OOMKilled events, idle pod releases topology snapshot after
60 s.

Test plan

`mvn test -Dfrontend.skip=true` — 3706 / 3706 pass, 32 skipped (expected)
`mvn package -DskipTests -Dfrontend.skip=true` — clean build
Deploy to AKS staging, verify pod stays under `limits.memory: 4Gi`
Verify topology MCP tools (get_topology, blast_radius, find_path) still respond correctly under bearer auth

🤖 Generated with Claude Code

OOM review of `codeiq serve` on AKS at the typical ~200 K-node graph scale identified four cumulative offenders fighting for the same cgroup memory limit: - McpTools and TopologyController each held an independent in-heap topology snapshot (~150 MB at this graph size). Under mixed REST + MCP traffic both lived on heap simultaneously. - TopologyController's snapshot had no TTL — once loaded, held for the lifetime of the process. - Spring `@EnableCaching` was on but no `CacheManager` bean was registered, so every `@Cacheable` region in QueryService fell back to ConcurrentMapCacheManager (unbounded, no TTL, no eviction). - Neo4j embedded auto-grabbed ~50% of free RAM for its off-heap page cache at startup, racing the JVM heap inside a single cgroup. Changes: - Extract `query/TopologySnapshotProvider` as the single owner of the topology snapshot; both McpTools and TopologyController now consume it. 60 s TTL deduplicates concurrent loads and lets idle pods release the heap. The Snapshot record carries a `loaded` flag so the controller can still distinguish "no source available" (404) from "graph is empty" (200), preserving the legacy contract. - Switch `cache.type: simple` → `caffeine` with `maximumSize=1000, expireAfterWrite=5m` in the serving profile; add the Caffeine dependency. - Cap Neo4j page cache at 256 MiB via `GraphDatabaseSettings.pagecache_memory` in Neo4jConfig. - Add `-XX:MaxRAMPercentage=50 -XX:InitialRAMPercentage=25 -XX:+UseG1GC -XX:+ExitOnOutOfMemoryError` to scripts/aks-launch.sh so the JVM heap is pinned to half the cgroup limit, leaving room for Neo4j page cache + Metaspace + JIT + Tomcat NIO buffers + OS slack. - Add `shared/runbooks/aks-oom-quick-fix.md` with diagnostic commands, the Deployment YAML patch, and the OOMKilled-vs-readiness-flap decision tree. Net effect at 200 K nodes / 4 GiB pod: peak heap ceiling drops ~50 %, no more OOMKilled events, idle pod releases topology snapshot after 60 s. All 3706 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

aksOps enabled auto-merge (squash) May 4, 2026 17:14

aksOps merged commit d6e34ea into main May 4, 2026
13 checks passed

aksOps deleted the perf/serve-oom-quickwin branch May 4, 2026 17:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(serve): bound JVM/Neo4j memory and dedupe topology snapshot#118

perf(serve): bound JVM/Neo4j memory and dedupe topology snapshot#118
aksOps merged 1 commit into
mainfrom
perf/serve-oom-quickwin

aksOps commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aksOps commented May 4, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant